import pandas as pd
import numpy as np
import plotly_express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from statsmodels.graphics.tsaplots import plot_acf
import seaborn as sns
import scipy.stats as stats
import bamboolib as bam
import statsmodels.api as sm
data = pd.read_csv('https://raw.githubusercontent.com/JaiminBrahmbhatt/MidsemQue/main/data.csv')
data
Normal
Continuos Uniform
None of these (Neither)
So what is Normal Distibuted Data ?
A normal distribution is a common probability distribution . It has a shape often referred to as a "bell curve."
Many everyday data sets typically follow a normal distribution: for example, the heights of adult humans, the scores on a test given to a large class, errors in measurements.
The normal distribution is always symmetrical about the mean.
The standard deviation is the measure of how spread out a normally distributed set of data is. It is a statistic that tells you how closely all of the examples are gathered around the mean in a data set. The shape of a normal distribution is determined by the mean and the standard deviation. The steeper the bell curve, the smaller the standard deviation. If the examples are spread far apart, the bell curve will be much flatter, meaning the standard deviation is large.
So to find we can either plot the graphs based on their frequency by dividing them in small bins or we can plot the histograms or Kernel Distributed Estimation.
So here we calculate mean, standard deviation and z-score
data
data.mean()
data.std()
Calculating Z-score of whole dataset based on axis = 0 (i.e based on individual columns)
datazscore = data.apply(stats.zscore)
datazscore
So here we see that y column shows Z-score of 0 that implies that Column Y has missing value
data.isnull().sum()
It is better to replace the missing value here I am considering mean value as there is only 1 missing value in dataframe
data[['y']] = data[['y']].fillna(data[['y']].mean())
Again calculating Z-score of whole dataset based on axis = 0 (i.e based on individual columns)
datazscore = data.apply(stats.zscore)
datazscore
px.box(data, points='all', template='presentation')
Plotting the data based on kernel distrubution estimation (kde) gives us the proper distrubution of values in columns
ax = data.plot.kde()
From these distribution we can say that,
w = neither normal nor continuous uniform
x = neither normal nor continous uniform
y = normal distribution
z = normal distribution
px.histogram(datazscore, x='w' ,template='ggplot2')
Column W seems to be neither normally distributed nor continuous uniform
px.histogram(datazscore, x='x' ,template='ggplot2')
Column X seems to be neither normally distributed nor continuous uniform
px.histogram(datazscore, x='y' ,template='ggplot2')
Column Y seems to be normally distributed
px.histogram(datazscore, x='z' ,template='ggplot2')
Column Z seems to be normally distributed
So here we start with plotting the data for each column w, x, y and z
px.line(data , template='presentation')
px.line(y=data['w'], template='presentation' , title='Plot for Column W')
So here looking at graph we can directly say that there is some sort of periodicity in column W
And looking at the graph we can say that 100 instances are taken by the series to complete one cycle which means it will take about 3.333333333 seconds to repeat itself when we take regular interval of 30Hz or in other words it will repeat itself every 100 instances. So, here in this series there are 5 cycles in total
sm.graphics.tsa.plot_acf(data['w'], lags=100)
plt.show()
px.line(y=data['x'], template='presentation', title='Plot for Column X')
So here looking at graph we cannot say if Column X has periodicity
px.line(y=data['y'], template='presentation', title='Plot for Column Y')
So here looking at graph we cannot say if Column Y has periodicity
px.line(y=data['z'], template='presentation', title='Plot for Column Z')
So here looking at graph we cannot say if Column Z has periodicity
spearmancorr = data.corr(method='spearman')
spearmancorr
sns.heatmap(spearmancorr,
xticklabels=spearmancorr.columns,
yticklabels=spearmancorr.columns,
cmap='RdBu_r',
annot=True,
linewidth=0.5)
Using Spearman Correlation we can say that
X --> Y has decent positive relationship
Y --> Z has decent positive relationship
pearsoncorr = data.corr(method='pearson')
pearsoncorr
sns.heatmap(pearsoncorr,
xticklabels=pearsoncorr.columns,
yticklabels=pearsoncorr.columns,
cmap='RdBu_r',
annot=True,
linewidth=0.5)
Using Spearman Correlation we can say that
X --> Y has decent positive relationship
Y --> Z has decent positive relationship
px.scatter(data, x='w', y='x', template='presentation' , title='Scatter Plot W vs X')
px.scatter(data, x='w', y='y' ,template='presentation' ,title='Scatter Plot W vs Y')
px.scatter(data, x='w', y='z' ,template='presentation' ,title='Scatter Plot W vs Z')
px.scatter(data, x='x', y='y' ,template='presentation' ,title='Scatter Plot X vs Y')
px.scatter(data, x='x', y='z' ,template='presentation' ,title='Scatter Plot X vs Z')
px.scatter(data, x='y', y='z' ,template='presentation' ,title='Scatter Plot Y vs Z')
data['x'].autocorr()
Seems there is very weak to none correlation (pattern) for datum X
data['y'].autocorr()
Seems there is negative and very weak to none correlation (pattern) for datum Y
data['z'].autocorr()
Seems there is very weak to none correlation (pattern) for datum Z
data['w'].autocorr()
It shows very strong correlation (pattern) for datum W
from statsmodels.graphics.tsaplots import plot_acf
col = data.columns
for i in col:
plot_acf(data[i])
print(i)
plt.show()